---
title: Speech-to-Text
description: Learn how to transcribe audio with Agno agents.
---

Agno agents can transcribe audio files using different tools and models. You can use native capabilities of OpenAI or fully multimodal Gemini models.

<Tip>
  Examples of ways to do Audio to Text (Transcribe) are:
   - [Using Gemini model](/examples/concepts/multimodal/audio-to-text)
   - [Using OpenAI Model](/examples/concepts/agent/multimodal/audio_input_output)
   - [Using `OpenAI Tool`](/concepts/tools/toolkits/models/openai#1-transcribing-audio)
   - [Using `Groq Tool`](/concepts/tools/toolkits/models/groq#1-transcribing-audio)
</Tip>

## Using OpenAI Whisper (Cloud)

The following agent uses OpenAI Whisper API for audio transcription.

```python cookbook/tools/models/openai_tools.py
import base64
from pathlib import Path

from agno.agent import Agent
from agno.run.agent import RunOutput
from agno.tools.openai import OpenAITools
from agno.utils.media import download_file, save_base64_data

# Example 1: Transcription
url = "https://agno-public.s3.amazonaws.com/demo_data/sample_conversation.wav"

local_audio_path = Path("tmp/sample_conversation.wav")
print(f"Downloading file to local path: {local_audio_path}")
download_file(url, local_audio_path)

transcription_agent = Agent(
    tools=[OpenAITools(transcription_model="gpt-4o-transcribe")],
    markdown=True,
)
transcription_agent.print_response(
    f"Transcribe the audio file for this file: {local_audio_path}"
)
```

**Best for**: High accuracy, cloud processing

## Using Multimodal Models

Multimodal models like Gemini can transcribe audio directly without additional tools.

```python cookbook/agents/multimodal/audio_to_text.py
import requests
from agno.agent import Agent
from agno.media import Audio
from agno.models.google import Gemini

agent = Agent(
    model=Gemini(id="gemini-2.0-flash-exp"),
    markdown=True,
)

url = "https://agno-public.s3.us-east-1.amazonaws.com/demo_data/QA-01.mp3"

response = requests.get(url)
audio_content = response.content

# Give a transcript of this audio conversation. Use speaker A, speaker B to identify speakers.

agent.print_response(
    "Give a transcript of this audio conversation. Use speaker A, speaker B to identify speakers.",
    audio=[Audio(content=audio_content)],
    stream=True,
)
```

**Best for**: Direct model integration, conversation understanding

## Team-Based Transcription

Teams can handle complex audio processing workflows with multiple specialized agents.

```python cookbook/teams/multimodal/audio_to_text.py
import requests
from agno.agent import Agent
from agno.media import Audio
from agno.models.google import Gemini
from agno.team import Team

transcription_specialist = Agent(
    name="Transcription Specialist",
    role="Convert audio to accurate text transcriptions",
    model=Gemini(id="gemini-2.0-flash-exp"),
    instructions=[
        "Transcribe audio with high accuracy",
        "Identify speakers clearly as Speaker A, Speaker B, etc.",
        "Maintain conversation flow and context",
    ],
)

content_analyzer = Agent(
    name="Content Analyzer",
    role="Analyze transcribed content for insights",
    model=Gemini(id="gemini-2.0-flash-exp"),
    instructions=[
        "Analyze transcription for key themes and insights",
        "Provide summaries and extract important information",
    ],
)

# Create a team for collaborative audio-to-text processing
audio_team = Team(
    name="Audio Analysis Team",
    model=Gemini(id="gemini-2.0-flash-exp"),
    members=[transcription_specialist, content_analyzer],
    instructions=[
        "Work together to transcribe and analyze audio content.",
        "Transcription Specialist: First convert audio to accurate text with speaker identification.",
        "Content Analyzer: Analyze transcription for insights and key themes.",
    ],
    markdown=True,
)

url = "https://agno-public.s3.us-east-1.amazonaws.com/demo_data/QA-01.mp3"

response = requests.get(url)
audio_content = response.content

audio_team.print_response(
    "Give a transcript of this audio conversation. Use speaker A, speaker B to identify speakers.",
    audio=[Audio(content=audio_content)],
    stream=True,
)
```

**Best for**: Complex workflows, multiple processing steps

## Developer Resources

- View [Multimodal Examples](/examples/concepts/agent/multimodal/audio_to_text)
- View [Team Examples](/examples/concepts/teams/multimodal/audio_to_text)
- View [OpenAI Toolkit](/concepts/tools/toolkits/models/openai)
